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Mammography is widely recognized as the most reliable technique for early detection of breast cancers. Auto- 
mated or semi-automated computerized classification schemes can be very useful in assisting radiologists with a 
second opinion about the visual diagnosis of breast lesions, thus leading to a reduction in the number of unneces- 
sary biopsies. We present a computer-aided diagnosis (CADi) system for the characterization of massive lesions 
in mammograms, whose aim is to distinguish malignant from benign masses. The CADi system we realized is 
based on a three-stage algorithm: a) a segmentation technique extracts the contours of the massive lesion from 
the image; b) sixteen features based on size and shape of the lesion are computed; c) a neural classifier merges 
the features into an estimated likelihood of malignancy. A dataset of 226 massive lesions (109 malignant and 117 
benign) has been used in this study. The system performances have been evaluated terms of the receiver-operating 
characteristic (ROC) analysis, obtaining Az — 0.80 ± 0.04 as the estimated area under the ROC curve. 
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Introduction 

Breast cancer is still one of the main causes 
of death among women, despite early detections 
have recently contributed to a significant decrease 
in the breast-cancer mortality [1|2| . Mammogra- 
phy is an effective technique for detecting breast 
cancer in its early stages 3.. Once a massive le- 
sion is detected on a mammogram, the radiologist 
recommends further investigations, depending on 
the likelihood of malignancy he assigns to the le- 
sion. However, the characterization of massive 
lesions merely on the basis of a visual analysis of 
the mammogram is a very difficult task and a high 
number of unnecessary biopsies are actually per- 
formed in the routine clinical activity. The rate 
of positive findings for cancers at biopsy ranges 
from 15% to 30% [1], i.e. the specificity in diffe- 
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rentiating malignant from benign lesions merely 
on the basis of the radiologist's interpretation of 
mammograms is rather low. Methods to improve 
mammographic specificity without missing can- 
cer have to be developed. Computerized method 
have recently shown a great potential in assisting 
radiologists in the visual diagnosis of the lesions, 
by providing them with a second opinion about 
the degree of malignancy of a lesion |5l6l7j . 

The computerized system for the classification 
of benign and malignant massive lesions we de- 
scribe in this paper is a semi-automated one, 
i.e. it provides a likelihood of malignancy for a 
physician-selected region of a mammogram. 

This paper is structured as follows: a techni- 
cal description of the method is given in sec. [ll 
section [2] illustrates the dataset of mammograms 
we used for this study and sec. [3] reports on the 
performances the CADi system achieved in diffe- 
rentiating malignant from benign massive lesions. 
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1. Description of the CADi system 

The system for characterizing massive lesions 
we realized is based on a three-stage algorithm: 
first, a segmentation technique extracts the mas- 
sive lesion from the image; then, several features 
based on size, shape and texture of the lesion are 
computed; finally, a neural classifier merges the 
features into a likelihood of malignancy for that 
lesion. 

1.1. Segmentation 

Massive lesions are extremely variable in size, 
shape and density; they can exhibit a very poor 
image contrast or can be highly connected to the 
surrounding parenchymal tissue. For these rea- 
sons segmenting massive lesions from the nonuni- 
form normal breast tissue is considered a non- 
trivial task and much efforts have already gone 
through this issue |8|9|10|llj . 

The segmentation algorithm we developed is 
an extension and a refinement of the strategy 
proposed in [T^] for the mass segmentation in 
the computerized analysis of breast tumors on 
sonograms. A massive lesion is automatically 
identified within a rectangular Region Of Inter- 
est (ROI) interactively chosen by the radiologist. 
The ROIs contain the lesions as well as a consid- 
erable part of normal tissue. In our segmentation 
procedure the non-tumor regions in a ROI are re- 
moved by applying the following processing steps 
(fig.©: 

• The pixel characterized by the maximum- 
intensity value in the central area of the 
ROI is taken as the seed point for the seg- 
mentation algorithm. 

• A number of radial lines are depicted from 
the seed point to the boundary of the ROI. 

• For each pixel along each radial line the lo- 
cal variance (i.e. the variance of the entries 
of a n X n matrix containing the pixel and 
its neighborhood) is computed. The pixel 
maximizing the local variance is most likely 
the one located on the boundary between 
the mass and the surrounding tissue and it 
is referred as critical point. 



• The critical points determined for each ra- 
dial line are linearly interpolated and a 
coarse boundary for the lesion is deter- 
mined. 

• The pixels inside this coarse region are 
taken as new seed points for iterating the 
procedure in order to end up with a more 
accurate identification of the shape of the 
lesion. 

• A set of candidates to represent the mass 
boundary is thus obtained. Every identi- 
fied point is accepted and the area inside 
the resulting thick border is filled. Once 
the possibly present non-connected objects 
are removed, the thick boundary and the 
area inside it are accepted as the segmented 
massive lesion. 
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Figure 1. Basic schema of the segmentation pro- 
cedure. 
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1.2. Feature extraction 

Once the masses have been segmented out 
of the surromiding normal tissue, a set of sui- 
table features are computed in order to allow 
a decision-making system to distinguish benign 
from malignant lesions [13114115116] . The degree 
of malignancy of a lesion is generally correlated 
to the appearance of arms and spiculations on 
the mass boundary. The more irregular the mass 
shape, the higher the degree of malignancy pos- 
sibly associated to that lesion. Our CADi sys- 
tem extracts 16 features from the segmented le- 
sions: the mass area A; the mass perimeter P; 
the circularity C — AttA/P'^; the mean and the 
standard deviation of the normalized radial length 
(i.e. the Euclidean distance from the center of 
mass of the segmented lesion to the i**^ pixel on 
the perimeter and normalized to the maximum 
distance for that mass); the radial length en- 
tropy (i.e. a probabilistic measure computed from 
the histogram of the normalized radial length as 
E = — J2k=r logPfc, where Pk is the probabi- 
lity that the normalized radial length is between 
d{i) and d{i) -I- l/A'^bins and A^bins is the number of 
bins the normalized histogram has been divided 
in); the zero crossing (i.e. a count of the number 
of times the radial distance plot crosses the ave- 
rage radial distance); the maximum and the mini- 
mum axis of the lesion; the mean and the standard 
deviation of the variation ratio (i.e. the modulus 
of the variations of the radial lengths from their 
mean value are computed and only those excee- 
ding the value war'max/2, where warmax is the 
maximum variation, are considered as dominant 
variations and averaged); the convexity (i.e. the 
ratio between the mass area and the area of the 
smallest convex containing the mass); the mean, 
the standard deviation, the skewness and the kur- 
tosis of the mass grey-level intensity values. 

The 16 features were chosen with the aim of 
enlightening the spiculation characteristics of the 
lesions. The first 12 features in the above descrip- 
tion are related to the mass shape and have some 
evident correlations with the degree of spicula- 
tion of the lesions. Nevertheless, the remaining 
4 features derived from the grey-level intensity 
distribution of the segmented area also aim at in- 
vestigating the degree of mass spiculation: the 



standard deviation, the skewness and the kurto- 
sis carry out the information about irregularities 
characterizing the mass, whereas the mean value 
accounts for an offset to be referred to these three 
parameters. 

1.3. Classification 

The 16 features extracted from each lesion 
are classified by a standard three-layer feed- 
forward neural network with n input, h hidden 
and two output neurons. A supervised training 
based on the back-propagation algorithm with 
sigmoid activation functions both for the hid- 
den and the output layer has been performed. 
The performances of the training algorithm were 
evaluated according to the 5x2 cross validation 
method [17]. It is the recommended test to be 
performed on algorithms that can be executed 10 
times because it can provide a reliable estimate of 
the variation of the algorithm performances due 
to the choice of the training set. This method 
consists in performing 5 replications of the 2-fold 
cross validation method [18] . At each replication, 
the available data are randomly partitioned into 
2 sets {Ai and Bi for i = 1, . . .5) with an almost 
equal number of entries. The learning algorithm 
is trained on each set and tested on the other 
one. The system performances are given in terms 
of the sensitivity and specificity values, where the 
sensitivity is defined as the true positive fraction 
(fraction of malignant masses correctly classified 
by the system), whereas the specificity as the 
true negative fraction (fraction of benign masses 
correctly classified by the system). In order to 
show the trade off between the sensitivity and 
the specificity, a Receiver Operating Character- 
istic (ROC) analysis has been performed |19|20] . 
The ROC curve is obtained by plotting the true 
positive fraction versus the false positive fraction 
of the cases (1 - specificity), computed while the 
decision threshold of the classifier is varied. Each 
decision threshold results in a corresponding ope- 
rating point on the curve. 

2. Image dataset 

The image dataset used for this study has been 
extracted from the database of mammograms col- 
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lected in the framework of a collaboration be- 
tween physicists from several Italian Universities 
and INFN (Istituto Nazionale di Fisica Nucle- 
are) Sections, and radiologists from several Ital- 
ian Hospitals [21l22j . The mammograms come 
both from screening and from the routine work 
carried out in the participating Hospitals. The 
18 X 24 cm^ mammographic films were digitized 
by a CCD linear scanner (Linotype Hell, Saphir 
X-ray), obtaining images characterized by a 85/^m 
pixel pitch and a 12-bit resolution. The patholo- 
gical images are fully characterized by a consis- 
tent description, including the radiological diag- 
nosis, the histological data and the coordinates of 
the center and the approximate radius (in pixel 
units) of a circle drawn by the radiologists around 
the lesion (truth circle). Mammograms with no 
sign of pathology are stored as normal images 
only after a follow up of at least three years. 

A set of 226 massive lesions were used in this 
study: 109 malignant and 117 benign masses 
were extracted from single- view cranio-caudal or 
lateral mammograms. The dataset we analyzed 
can be considered as representative of the patient 
population that is sent for biopsy under the cur- 
rent clinical criteria. 

3. Results 

The 226 massive lesions were segmented by the 
system and shown to an experienced radiologist, 
whose assistance in accepting or rejecting the pro- 
posed mass contours was essential for the eval- 
uation of the segmentation algorithm efhciency. 
Despite the borders of the massive lesions are 
usually not very sharp in mammographic ima- 
ges, the segmentation procedure we carried out 
leads to a quite accurate identification of the mass 
shapes, as can be noticed in fig. [2l The radiolo- 
gist confirmed only the segmented masses whose 
contour was sufficiently close to that she would 
have drawn by hand on the image. The dataset 
of 226 cases available for our analysis was reduced 
to 200 successfully-segmented masses (95 malig- 
nant and 105 benign masses), thus corresponding 
to an efficiency e = 88.5% for the segmentation 
algorithm. 

Once the 16 features were extracted from each 
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Figure 2. Examples of segmented masses: two 
malignant masses (top) and two benign masses 
(bottom). 



well-segmented mass, 5 different train and 5 diffe- 
rent test sets for the 5x2 cross validation analysis 
were prepared by randomly assigning each of the 
200 vectors of features to the train or test set for 
each of the 5 different trials. The optimization 
of the network performances was obtained in our 
case by assigning 3 neurons to the hidden layer, 
resulting in a final architecture for the net of 16 
input, 3 hidden and 2 output neurons. 

The sensitivity and specificity our learning al- 
gorithm realized on each dataset are shown in 
tab. [TJ As can be noticed, the performances the 
neural classifier achieves are robust, i.e. almost 
independent of the partitioning of the available 
data into the train and test sets. The average per- 
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Table 1 

Evaluation of the performances of the standard 
back-propagation learning algorithm for the neu- 
ral classifier according to the 5x2 cross validation 



method. 


Train Set 


Test Set 


Sens. (%) 


Spec. (%) 


Ai 


Bi 


78.7 


84.9 


Bi 


Ai 


77.1 


84.6 


A2 


B2 


78.7 


79.3 


B2 


A2 


77.1 


78.9 


^3 


Bs 


80.9 


72.9 


^3 


A3 


77.1 


78.7 


^4 


Bi 


79.2 


77.4 


Bi 


Ai 


78.7 


78.9 


As 


B5 


75.6 


81.8 


^5 


A5 


78.0 


74.0 



formances achieved in the testing phase are 78.1% 
for the sensitivity and 79.1% for the specificity. 
The ROC curve obtained in the classification of 
the test set B2, containing 100 patterns derived 
from 47 malignant masses and 53 benign masses, 
is reported in fig.[3l To this curve belongs the ope- 
rating point whose values of sensitivity (78.7%) 
and specificity (79.3%) are closer to the average 
values achieved on the test sets by the ten diffe- 
rent neural networks, as shown in tab.[T] As the 
classifier performances are conveniently evaluated 
in terms of the area A^ under the ROC curve, we 
have estimated this parameter for the curve plot- 
ted in fig. El obtaining A, = 0.80 ± 0.04, where 
the standard error has been evaluated according 
to the formula given by Hanley and McNeil |20j . 

4. Conclusions and discussion 

The system for the classification of mammo- 
graphic massive lesions into malignant and benign 
we realized aims at improving the radiologist's vi- 
sual diagnosis of the degree of the lesion malig- 
nancy. The system is a semi-automated one, i.e. 
it segments and analyzes lesions from physician- 
located rectangular ROIs. As mass segmentation 
plays a key role in such kind of systems, most 
efforts have been devoted to the realization of a 
robust segmentation technique. It is based on 
edge detection and it works with a comparable 



1-0- 

0,8- 

.&0.6- / 
'« / 

ED 0.4- / 

CO ; 

0.2 - I 

0.0-1 1 1 1 1 1 1 1 1 1 r- 

0.0 0.2 0.4 0.6 0,8 1.0 

1 - Specificity 



Figure 3. ROC curve obtained in the classifica- 
tion of the test set B2 (see tab. [T]). 



efficiency both on malignant and benign masses. 
The segmentation efficiency e — 88.5% has been 
evaluated with the assistance of an experienced 
radiologist who accepted or rejected each seg- 
mented mass. With respect to a number of au- 
tomated or semi-automated systems with a simi- 
lar purpose and using a similar approach already 
discussed in the literature |8I9I10I11] , the system 
we present is characterized by a robust segmen- 
tation technique: it is based on an edge-detection 
algorithm completely free from any application- 
dependent parameter. 

Sixteen features based on shape, size and inten- 
sity have been extracted out of each segmented 
area and merged by a neural decision-making 
system. The neural network performances have 
been evaluated in terms of the ROC analysis, ob- 
taining an estimated area under the ROC curve 
Az = 0.80 ± 0.04. It gives the indication that the 
segmentation procedure we developed provides a 
quite accurate approximation of the mass shapes 
and that the features wc took into account for the 
classification have a good discriminating power. 
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